Climate change is one of the biggest challenges facing humanity today resulting among other effects in more frequent and severe weather events. The Federal Emergency Management Agency (FEMA) is part of Homeland Security in the USA and responsible for coordinating the response to disasters and emergencies, both natural and man-made. FEMA provides support to state and local governments, as well as to individuals affected by disasters, by offering financial assistance, providing temporary housing, and assisting with recovery efforts. The agency also plays a key role in disaster preparedness and risk reduction through programs that help communities develop plans and take steps to reduce the impact of future disasters.
FEMA provides various data sets on disasters that occurred in the past, some of them going back to the 1960s. For our project we merged two data sets from FEMA and enriched them with additional information like geographical data (e.g. longitude, latitude) and population size per state from Wikipedia. The first data set from FEMA contains for example information on the type of disaster (fire, flooding, etc.), the state in which the disaster occurred and the dates when it occurred. From the second data set, we pulled information on the amount of damage each disaster caused, measured in amount of money. Since the second data set only goes back to 2002, we decided to do our analysis in the time frame of the past 20 years.
After merging and cleaning the data from the four sources, our final data set to do the analysis on looked like this:
| Disaster# | Type | County | State | Latitude | Longitude | Damage | Approved | Repair | Population | IncomeCapita | DisasterBegin | DisasterEnd | Duration |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1971 | Severe Storm | Marshall | Alabama | 34.3670 | -86.3066 | 468614.48 | 116003.84 | 101089.45 | 96137 | 20382 | 2011-04-15 | 2011-05-31 | 46 days |
| 4655 | Flood | Park | Montana | 45.4884 | -110.5267 | 2901.34 | 12532.99 | 11078.99 | 16513 | 24611 | 2022-06-10 | 2022-07-05 | 25 days |
| 4637 | Tornado | Cheatham | Tennessee | 36.2612 | -87.0868 | 7753.23 | 14653.23 | 6344.23 | 40539 | 23459 | 2021-12-10 | 2021-12-11 | 1 days |
| 4633 | Tornado | Jackson | Arkansas | 35.5992 | -91.2145 | 199.50 | 199.50 | 0.00 | 16908 | 15609 | 2021-12-10 | 2021-12-11 | 1 days |
| 4630 | Tornado | Lyon | Kentucky | 37.0191 | -88.0833 | 0.00 | 0.00 | 0.00 | 8226 | 22123 | 2021-12-10 | 2021-12-11 | 1 days |
| 4618 | Hurricane | Bucks | Pennsylvania | 40.3370 | -75.1069 | 92390.42 | 52031.76 | 42999.73 | 627668 | 37466 | 2021-08-31 | 2021-09-05 | 5 days |
Displaying the cumulative frequencies of each disaster type relative to each other, we can learn from the stacked bar chart that “severe storms” are by far the commonest disaster type, followed by hurricanes and biological disasters.
Various insights can be gained from the time plot below. For example, it looks like severe storms happen quite frequently in general except for the years of approximately 2013-2015 and 2017/2018-ish. Furthermore, only one volcanic eruption and one severe ice storm was recorded in the displayed time period. Dam or Levee breaks and biological disasters did not occur between 2000 and 2023. We could hypothesize that fires seemed to last longer in years previous to about 2013 compared to afterwards - maybe due to more efficient ways that were developed to extinguish them. However, in this plot, the duration of almost 500 disasters are shown which might overlap. Since using 500 different colors for each individual disaster might not help much, we would switch to a different way of showing this information, for example an interactive graphic with Shiny.
Exploring the disaster types a little bit further, we can detect in the boxplot below that fires show the largest variation in terms of duration. The disaster with the biggest median duration is volcanic eruptions. However, this observation should be analyzed further since it looks like there was only one incidence in the analysed time frame (only the median is shown, no variation visible which means that there is only one data point in this group). Ouliers with very long durations can be detected in the groups of floods, hurricanes, severe storms, and tornados.
To analyse the data on damage, we first took a look at the distribution of the data points. The histogram indicated a right skewed distribution.
Also, the difference between the individual damage amounts are quite large which makes it hard to work with. Therefore, we took the logarithm of the data, which resulted in displaying a normal distribution:
With the transformed data, we could generate a plot showing the damage disasters have caused per state in the US over the past 20 years:
## function (x, ...)
## {
## if (need_screenshot(x, ...)) {
## html_screenshot(x)
## }
## else {
## UseMethod("knit_print")
## }
## }
## <bytecode: 0x145a5a5c8>
## <environment: namespace:knitr>
The following map shows the population per state. Similar to above, we are using the log-transformed data.
How much does the estimated damage and the approved amount differ? As before, we worked with log-transformed data.By displaying overlapping histograms, we can conclude that there are a few instances where the approved amount is a lot lower compared to the damage that actually happened. Vice versa, there are instances where the approved amount exeeded the actual damage.
When comparing the amount of money that was approved to repair the damage with the actual costs spent on reconstructions, we can see that in almost all cases, less money was spent than approved.
Diving into this in a bit more detail, we can distinguish between disaster types. For some, like tornados or severe ice storms, the approved amounts exceed the spent amounts far more compared to others like earthquakes or floods where the two amounts are much more equal.
boxplot.repair.df <- data.frame(ApprovedLog = log(df.prep.5$Approved),
RepairLog = log(df.prep.5$Repair),
Type = df.prep.5$Type)
boxplot.repair.num.df <- boxplot.repair.df %>%
filter_if(~is.numeric(.), all_vars(!is.infinite(.)))
b.rep.inf <- boxplot.repair.num.df %>%
pivot_longer(ApprovedLog:RepairLog, names_to = "Names", values_to = "values") %>%
ggplot(aes(y = values, x = Type, fill = Names))+
geom_boxplot() +
facet_wrap(~Type, scale="free")+
ggtitle('Approved and Repair Amount log-transformed without Inf Rows')
b.rep.inf + theme(
plot.title = element_text(size=12, face="bold.italic"),
axis.text.x=element_blank())
b.rep <- boxplot.repair.df %>%
pivot_longer(ApprovedLog:RepairLog, names_to = "Names", values_to = "values") %>%
ggplot(aes(y = values, x = Type, fill = Names))+
geom_boxplot() +
facet_wrap(~Type, scale="free")+
ggtitle('Approved and Repair Amount log-transformed with Inf Rows')
b.rep+
theme(
plot.title = element_text(color = 'red',size=12, face="bold.italic"),
axis.text.x=element_blank())
# ggarrange(b.rep.inf, b.rep,
# labels = c("With Infinite Numbers", "Without Infinite Number"),
# ncol = 1, nrow = 2)
The Shiny-Dashboard provides an interactive possibility to further inspect aspects of the natural disasters and where they occur. https://milicapajkic.shinyapps.io/Dashboard/
Duration of Disaster Type by State and County
Heatmap
The goal of this model fitting chapter is to see if the linear regression is appropriate tool to predict/explain the damage amount and state.
The sum of the damage across one state shows a light right skewedness. Therefore, we performed a log transformation. Now it is more of a normal distribution.
In this Chapter we want to see, if the amount of Damage can be predicted with the given data points.
H1: With a higher income per capita, the damage will be lower.
The reason being, that with higher income capita, the preparation and precaution for natural disaster are higher. For one, there is no monetary strain like in other counties with less income per capita.
H2: The damage will be lower, depending which natural catastrophe is happening.
First, we will need to see the distribution of the important variables. It appears, that damage and income per capita needs to be logarithmized. So we logarithimized it.
The scatter plot indicates that there is a positive relationship between income per capita and damage.
To see if the chosen variables have an influence on damage a testing was done (drop1). The model.full indicates that only log.income, Type and Duration have relevant effects on the dependent variable damage.
ggplot(data = df.prep.5,
mapping = aes(x = IncomeCapita, fill = '')) +
geom_histogram(show.legend = FALSE)+
labs(x = 'Income', y = 'Count') +
scale_fill_manual(values = c('lightblue2'))+
theme(axis.text=element_text(size=12),
axis.title=element_text(size=18,face="bold"))
ggplot(data = df.prep.5,
mapping = aes(x = log(IncomeCapita), fill = '')) +
geom_histogram(show.legend = FALSE)+
labs(x = 'log. Income', y = 'Count') +
scale_fill_manual(values = c('mistyrose3'))+
theme(axis.text=element_text(size=12),
axis.title=element_text(size=18,face="bold"))
plot(df.prep.5$Damage, df.prep.5$IncomeCapita)
plot(log(df.prep.5$Damage), log(df.prep.5$IncomeCapita))
ggplot(df.prep.5, aes(log(IncomeCapita),log(Damage))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
df.model <- df.prep.5
df.model$log.damage <- log(df.model$Damage)
df.model$log.income <- log(df.model$IncomeCapita)
#
df.model <- df.model[!is.na(df.model$log.damage) & is.finite(df.model$log.damage), ]
#
model <- lm(log.damage ~ log.income, data = na.omit(df.model))
summary(model)
##
## Call:
## lm(formula = log.damage ~ log.income, data = na.omit(df.model))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2833 -1.7712 -0.1211 1.6129 8.7048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.5667 5.2192 -0.683 0.4949
## log.income 1.2822 0.5184 2.473 0.0139 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.421 on 309 degrees of freedom
## Multiple R-squared: 0.01941, Adjusted R-squared: 0.01624
## F-statistic: 6.117 on 1 and 309 DF, p-value: 0.01392
In this model we are fitting all the variables and are afterwards implementing drop1 to see, which variable has a significant influence on the dependent variable. After this, we are dropping the variables with no significant influence on the variable damage. The final model includes income, type of disaster and the duration.
model.full <- lm(log.damage ~ log.income + State+ Type+ Population+ Duration, data = na.omit(df.model))
summary(model.full)
##
## Call:
## lm(formula = log.damage ~ log.income + State + Type + Population +
## Duration, data = na.omit(df.model))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.364 -1.379 0.000 1.358 6.059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.882e+00 7.277e+00 -0.671 0.5029
## log.income 1.451e+00 7.046e-01 2.060 0.0405 *
## StateAlaska -1.311e+00 1.723e+00 -0.761 0.4473
## StateArkansas 9.686e-01 9.610e-01 1.008 0.3145
## StateCalifornia 8.646e-01 1.401e+00 0.617 0.5377
## StateColorado -1.704e+00 1.760e+00 -0.969 0.3337
## StateConnecticut -4.361e-01 1.397e+00 -0.312 0.7551
## StateDelaware -1.544e+00 2.424e+00 -0.637 0.5246
## StateFlorida 1.421e-01 8.456e-01 0.168 0.8667
## StateGeorgia -8.184e-01 9.437e-01 -0.867 0.3867
## StateHawaii -4.165e-01 1.550e+00 -0.269 0.7884
## StateIllinois 3.577e-01 9.325e-01 0.384 0.7016
## StateIndiana -3.766e-01 9.327e-01 -0.404 0.6867
## StateIowa -4.459e-01 1.331e+00 -0.335 0.7379
## StateKansas 4.305e-02 1.475e+00 0.029 0.9767
## StateKentucky -9.493e-02 8.535e-01 -0.111 0.9115
## StateLouisiana 1.670e-01 8.970e-01 0.186 0.8524
## StateMaine -1.158e+00 1.760e+00 -0.658 0.5111
## StateMaryland 1.438e+00 1.782e+00 0.807 0.4205
## StateMassachusetts -9.197e-01 1.255e+00 -0.733 0.4645
## StateMichigan 4.720e-01 1.504e+00 0.314 0.7539
## StateMinnesota -1.209e+00 1.336e+00 -0.905 0.3662
## StateMississippi 9.258e-01 7.989e-01 1.159 0.2476
## StateMissouri -3.151e-01 9.981e-01 -0.316 0.7525
## StateMontana -1.686e+00 1.797e+00 -0.938 0.3492
## StateNebraska -2.212e+00 1.487e+00 -1.487 0.1382
## StateNevada 5.371e+00 2.382e+00 2.254 0.0250 *
## StateNew Hampshire -1.793e+00 1.353e+00 -1.325 0.1864
## StateNew Jersey -8.625e-01 1.095e+00 -0.788 0.4317
## StateNew Mexico 5.113e+00 2.382e+00 2.147 0.0328 *
## StateNew York -1.380e+00 9.197e-01 -1.501 0.1346
## StateNorth Carolina -2.630e-01 1.084e+00 -0.243 0.8086
## StateNorth Dakota -7.434e-01 1.839e+00 -0.404 0.6863
## StateOhio -4.104e-02 1.001e+00 -0.041 0.9673
## StateOklahoma -1.962e-01 9.490e-01 -0.207 0.8364
## StateOregon -5.421e-02 1.755e+00 -0.031 0.9754
## StatePennsylvania -7.499e-03 1.166e+00 -0.006 0.9949
## StateRhode Island 2.189e-01 2.450e+00 0.089 0.9289
## StateSouth Carolina 1.276e+00 1.321e+00 0.965 0.3353
## StateSouth Dakota -1.277e+00 1.369e+00 -0.933 0.3518
## StateTennessee -7.755e-01 9.151e-01 -0.847 0.3975
## StateTexas -1.098e-01 8.562e-01 -0.128 0.8980
## StateVermont -1.653e+00 2.395e+00 -0.690 0.4907
## StateVirginia -6.829e-01 1.150e+00 -0.594 0.5532
## StateWashington 8.993e-02 1.753e+00 0.051 0.9591
## StateWest Virginia -7.573e-01 8.890e-01 -0.852 0.3951
## StateWisconsin 4.020e-01 1.478e+00 0.272 0.7859
## StateWyoming -2.834e+00 2.434e+00 -1.165 0.2452
## TypeEarthquake 1.446e+00 2.973e+00 0.486 0.6272
## TypeFire 1.291e+00 2.933e+00 0.440 0.6603
## TypeFlood -2.039e-01 2.692e+00 -0.076 0.9397
## TypeHurricane -4.446e-01 2.707e+00 -0.164 0.8697
## TypeOther 3.339e+00 3.645e+00 0.916 0.3606
## TypeSevere Ice Storm -9.198e-01 3.163e+00 -0.291 0.7715
## TypeSevere Storm -5.954e-01 2.674e+00 -0.223 0.8240
## TypeTornado -2.885e+00 2.785e+00 -1.036 0.3014
## TypeVolcanic Eruption 7.260e+00 3.827e+00 1.897 0.0590 .
## Population 1.587e-07 2.442e-07 0.650 0.5165
## Duration 1.327e-02 6.997e-03 1.897 0.0590 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.302 on 252 degrees of freedom
## Multiple R-squared: 0.2769, Adjusted R-squared: 0.1105
## F-statistic: 1.664 on 58 and 252 DF, p-value: 0.004167
drop1(model.full, test = 'F')
model.right <- lm(log.damage ~ log.income + Type+ Duration, data = na.omit(df.model))
summary(model.right)
##
## Call:
## lm(formula = log.damage ~ log.income + Type + Duration, data = na.omit(df.model))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.960 -1.570 -0.078 1.551 5.864
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.750439 5.472425 0.320 0.7493
## log.income 0.837698 0.497469 1.684 0.0932 .
## TypeEarthquake 1.007025 2.474533 0.407 0.6843
## TypeFire 1.504898 2.482205 0.606 0.5448
## TypeFlood -0.929765 2.316326 -0.401 0.6884
## TypeHurricane -0.813160 2.297781 -0.354 0.7237
## TypeOther 3.478853 3.221237 1.080 0.2810
## TypeSevere Ice Storm -1.994636 2.789519 -0.715 0.4751
## TypeSevere Storm -1.207145 2.284732 -0.528 0.5976
## TypeTornado -3.328925 2.388800 -1.394 0.1645
## TypeVolcanic Eruption 6.932481 3.280384 2.113 0.0354 *
## Duration 0.009687 0.006022 1.609 0.1088
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.278 on 299 degrees of freedom
## Multiple R-squared: 0.1602, Adjusted R-squared: 0.1293
## F-statistic: 5.184 on 11 and 299 DF, p-value: 1.822e-07
fitted.model <- fitted(model)
str(fitted.model)
## Named num [1:311] 9.16 9.4 9.34 8.81 9.94 ...
## - attr(*, "names")= chr [1:311] "1" "2" "3" "4" ...
plot(log.damage ~ log.income, data = df.model,
main = "Model 'model'",
col = "pink4")
# points(fitted.model ~ log.income,
# col = "lightblue4",
# pch = 19,
# data = df.model)
# abline(model, col = "red4")
df.sub <- df.model[which(!is.na(fitted.model)), ]
# Create plot with both observed and fitted values
plot(log.damage ~ log.income, data = df.sub,
main = "Model 'model'",
col = "pink4")
points(fitted.model ~ log.income,
col = "lightblue4",
pch = 19,
data = df.sub)
abline(model, col = "red4")
# df.plot <- data.frame(log.damage = df.model$log.damage, log.income = df.model$log.income, fitted = fitted.model)
#
# # Plot the observed and fitted values
# plot(log.damage ~ log.income, data = df.plot, main = "Model 'model'", col = "pink4")
# points(fitted ~ log.income, data = df.plot, col = "lightblue4", pch = 19)
# abline(model, col = "red4")
For the chapter of choice we have chosen the shiny dashboard application. The focus lies on the exploratory analysis and to make it interactive for users. @ MARTINA: in the folder shiny there are the first tries
For every analysis some kind of outlier detection is a way to see if the data has some structure. The way this works, multiple variables are given and from these groups are being formed. The goal is to have big differences between the formed groups. However, within group difference should be small.
#Dataframe
df.outlier.0 <- data.frame(Approved = df.prep.5$Approved,
Repair = df.prep.5$Repair,
income = df.prep.5$IncomeCapita,
population = df.prep.5$Population,
damage = df.prep.5$Damage)
df.outlier.1 <- data.frame(repair = df.outlier.0$Repair,
income = df.outlier.0$income)
plot(df.outlier.1)
When we scale the data we get one big cluster. By examining, after the initial clustering, we can see that the elbow is at 2 or 5. The further analysis with plots of the clusters and silhouette plot gives the indication that two clusters are the better solution. This means
#prep the data
datas <- scale(df.outlier.1, center = FALSE, scale = TRUE)
datas <- na.omit(datas)
boxplot(datas)
plot(datas)
dists <- dist(datas)
km <- kmeans(datas, centers = 3, nstart = 10)
groups_km <- km$cluster
groups_km
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [297] 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [334] 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [371] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [408] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [445] 3
cluster_size <- cbind(sum(groups_km == 1), sum(groups_km == 2),
sum(groups_km == 3))
cluster_size
## [,1] [,2] [,3]
## [1,] 1 4 440
plot(datas, pch = groups_km, col=groups_km, lwd=2)
legend("topright", legend = 1:3, pch = 1:3, col=1:3, bty="n")
reps <- rep(0, 6)
for (i in 1:6) reps[i] <- sum(kmeans(datas, centers = i, nstart = 20)$withinss)
par(mfrow = c(1,1))
plot(1:6, reps, type = "b", xlab = "Number of groups", ylab = "Sum of squares")
text(3, 120, "Elbow point signifies how many \ngroups there could be", col = "red", cex = 1.08, pos = 4)
km2 <- kmeans(datas, centers = 2, nstart = 10)
groups_km2 <- km2$cluster
groups_km2
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [334] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [371] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [408] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [445] 2
cluster_size2 <- cbind(sum(groups_km2 == 1), sum(groups_km2 == 2))
cluster_size2
## [,1] [,2]
## [1,] 2 443
plot(datas, pch = groups_km2, col=groups_km2, lwd=2)
legend("topright", legend = 1:2, pch = 1:2, col=1:2, bty="n")
km5 <- kmeans(datas, centers = 5, nstart = 10)
groups_km5 <- km5$cluster
groups_km5
## [1] 2 2 2 2 2 5 3 5 2 2 2 2 5 5 2 5 2 5 2 2 2 2 2 2 2 2 2 2 5 5 5 2 5 2 2 2 2
## [38] 2 2 5 5 2 2 2 5 5 2 2 2 2 5 5 2 2 5 2 2 2 5 3 5 5 5 2 2 2 5 2 2 5 5 2 5 2
## [75] 5 2 2 5 2 2 2 2 2 5 5 2 5 2 5 2 2 5 2 2 2 2 5 2 2 5 5 5 2 2 2 5 2 5 5 5 5
## [112] 2 2 2 5 2 5 5 5 2 2 5 5 5 5 2 2 5 5 2 2 5 2 2 2 5 2 2 2 5 5 5 5 2 5 5 2 2
## [149] 5 2 2 5 2 2 5 2 2 2 2 2 2 2 5 5 5 2 2 5 2 5 2 2 5 2 2 2 2 2 2 5 2 2 5 5 5
## [186] 5 2 2 2 3 5 2 2 2 2 2 2 2 2 5 2 2 2 2 2 2 2 2 2 2 5 2 2 2 2 5 5 2 2 5 5 2
## [223] 2 2 2 2 2 5 2 5 5 1 5 5 5 5 2 5 2 2 2 5 2 2 2 2 5 2 2 2 2 2 5 2 5 2 2 2 5
## [260] 2 2 5 5 2 5 5 5 2 5 2 5 5 2 2 2 4 5 2 2 5 2 2 2 2 2 2 2 2 5 2 2 2 2 2 2 2
## [297] 3 2 2 2 2 2 2 5 2 2 2 2 2 2 2 5 2 2 2 5 2 2 2 2 2 5 2 2 2 2 2 2 2 2 2 2 2
## [334] 2 2 2 2 5 3 5 2 2 2 2 2 5 2 2 5 5 5 2 2 2 5 5 5 2 2 5 5 5 2 5 5 3 2 2 2 2
## [371] 5 2 2 2 5 5 5 5 5 5 2 5 5 5 2 5 2 2 5 5 5 5 2 5 2 2 5 5 2 2 2 2 2 2 2 2 2
## [408] 5 2 2 5 2 2 5 2 5 5 2 2 5 5 2 2 2 2 2 2 2 2 2 2 2 2 5 2 2 2 2 5 2 2 2 2 2
## [445] 2
cluster_size5 <- cbind(sum(groups_km5 == 1), sum(groups_km5 == 2),
sum(groups_km5 == 3), sum(groups_km5 == 4),sum(groups_km5 == 5))
cluster_size5
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 288 6 1 149
plot(datas, pch = groups_km5, col=groups_km5, lwd=2)
legend("topright", legend = 1:5, pch = 1:5, col=1:5, bty="n")
plot(silhouette(groups_km5, dists))
To make up for the time I missed during the block week on Monday afternoon, I was given the task to produce an additional chapter of choice with a R package that hasn’t been used before. Since we have date values in our data set, I thought it would be interesting to try and show a forecast. I started by generating a new dataframe containing the frequency with which “severe storms” happened per month between 2002 and 2023. Since (luckily!) there wasn’t such a disaster happening every month, I complemented the data frame by adding every month in the given time period and fill in “0” if there was no event in that month. To generate the time series plot, I loaded the packages “forecast” and also “zoo”, because for some reason, R converted the date column into a different index before using it as index for the x-axis. The zoo package allowed to specify the original date column as index for the x-axis.
library(forecast)
library(zoo)
Sstorms <- subset(df.prep.5, Type == "Severe Storm")
Sstorms_timeonly <- Sstorms[,c("Disaster#", "DisasterBegin")]
#calculating the frequency of severe storms per month
Sstorms_month_year <- format(Sstorms_timeonly$DisasterBegin, "%Y-%m")
freq_table <- table(Sstorms_month_year)
#generating the data frame
frequency <- data.frame(Sstorms_month_year = names(freq_table), frequency = as.numeric(freq_table))
#converting the date format to a format R can recognize for the time series plot
frequency$Sstorms_month_year <- as.Date(paste0(frequency$Sstorms_month_year, "-01"), format = "%Y-%m-%d")
##filling in all months without a severe storm happening and adding 0 values
# Create a sequence of dates covering the entire range
start_date <- min(frequency$Sstorms_month_year)
end_date <- max(frequency$Sstorms_month_year)
all_dates <- seq(start_date, end_date, by = "month")
# Create a new data frame with missing months filled with 0 values
dates_df <- data.frame(date = all_dates)
colnames(frequency) <- c("date", "frequency")
# Merge the new data frame with the original data frame
dates_merged <- merge(frequency, dates_df, by = "date", all = TRUE)
# Replace missing values with 0
dates_merged[is.na(dates_merged$frequency), "frequency"] <- 0
#somehow, R did not use the date column for the x-axis but converted it first into other indexes.
#Therefore, used the zoo package to create the time series object and specify
#date column as index
ts <- zoo(dates_merged$frequency, order.by = dates_merged$date)
frequency_tszoo <- zoo(dates_merged$frequency, order.by = dates_merged$date)
# Plot the time series with the original dates as the x-axis
plot(frequency_tszoo, type = "l", xlab = "Date", ylab = "Frequency", main = "Frequency of severe storms / month")
The time series plot indicates that there is a trend to decreasing frequency of severe storms in later years but no seasonal component can be clearly identified. Trying to decompose the data anyway was therefore not successful and resulted in the error message that the time series is not periodic or has less than two periods.
#stl <- stl(frequency_tszoo, na.action = "omit", s.window = 12)
In another try, I wanted to look at the autocorrelation and partial autocorrelation functions to get an indication if it could make sense to fit a linear model like an Auto-Regressive model on the data for the forecast. The acf resulted in the following plot whereas the pacf resulted in an error message that the maximal lag must be at least one.
acf(frequency_tszoo, na.action = na.pass)
#library(ggfortify)
#ggPacf(frequency_tszoo)
By looking at the time series plot, I would also suggest that it’s quite hard to do an accurate forecast since the frequency dropped considerably in the past few years. Therefore, I suggest we would need to get additional data to confirm that this is a long-lasting drop which we could base the forecast on.
PS: Just to try it out, I fitted an AR(1) model on the data as a whole and predicted the next 100 data points which resulted in non-nonsensical horizontal line:
PS: Just to try it out, I fitted an AR(1) model on the data as a whole and forecasted the next 100 data points which resulted in non-nonsensical horizontal line:
f_fit <- arima(frequency_tszoo, order = c(1,0,0))
forecast <- predict(f_fit, n.ahead = 100)
plot(frequency_tszoo,
xlim = c(18000, 19737),
xlab = "Time",
ylab = "Frequency")
lines(forecast$pred, lwd = 2, col = "red")
In this project, we focused on exploratory data analysis and different ways of displaying data. As it is often the case for amounts, count data, etc., our variables like amount of money tended to be right skewed which is why we mostly worked with log-transformed data. The long-transformation helps for example to generate figures that are easier to interpret in terms of detecting differences. However, the caveat is that the “real numbers” cannot be directly read from the graph. Furthermore, the log transformation “shifts” right-skewed data into a more gaussian distribution which is important when the goal is to use statistical analysis methods or machine learning models since most of the methods assume an underlying Gaussian distribution of the data. To generate plots, we quickly started to like the ggplot package. The explanation in the block seminar that ggplot uses a layered approach was very helpful and made using it and starting to play around much more easy and fun. Also the session on R Markdown was incredibly helpful and overall, we are taking a lot of great tips and tricks with us that will for sure make our daily coding life more enjoyable :-)